Alex Chen’s Blog - Optimization in machine learning

Tip

In this paper, we will give a thorough of optimizer used in the machine learning. Specially, the pytorch model will be used to illustrate the code implementation of the optimizer. We will also explain some latest optimizer effort related to the large language model.

Heuristic for the optimization

The machine learning training starts from the convex optimization problems. In deep learning, we usually deal with the tensor operations. Thus, the optimization is about the multivariate calculus. For this problem, we know that the derivative is in terms of the vector and matrix. To know about the concept of the optimization, we starts with the 1D optimization.

1D optimization

Suppose we have a nonlinear function to be solved: \[ f(x) = 0 \]

Now, we need to find the solution. And we can use the tangent line to approximate the solution and then use the Newton’s method to solve the problem. The Newton’s method is given by:

\[ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)},\quad f'(x_n) \neq 0 \]

After some iterations, we can get close to our solution. In terms of the convex optimization problem, the local minimum has the properties of $f'(x) = 0$ and $f''(x) > 0$. Thus, we can use the Newton’s method to solve the optimization problem:

\[ x_{n+1} = x_n - \alpha\frac{f'(x_n)}{f''(x_n)} \]

Here, we have two important concepts. The path is defined by the path $\{x_0,x_1,...\}$. The magnitude of the step from $x_n$ to $x_{n+1}$ is defined by the value of

\[ \left|\alpha \frac{f'(x)}{f''(x)}\right|. \]

Tensor optimization

In the case of tensors, the variables becomes the tensors and matrices, and we can note the gradient and the Hessian matrix. The gradient is the first order derivative and the Hessian matrix is the second order derivative. Suppose that we need to solve the following optimization problems:

\[ \min_{\boldsymbol\theta} F(\boldsymbol\theta). \]

Then, the gradient is given by:

\[ \nabla_\boldsymbol{\theta} F(\boldsymbol\theta) = \left[\frac{\partial F(\boldsymbol\theta)}{\partial \theta_1},\frac{\partial F(\boldsymbol\theta)}{\partial \theta_2},...,\frac{\partial F(\boldsymbol\theta)}{\partial \theta_n}\right]. \]

If the $\boldsymbol\theta$ is a matrix of size $m\times n$, then the gradient is a matrix of size $m\times n$. And we have

\[ \nabla_\boldsymbol{\theta} F(\boldsymbol\theta) = \begin{bmatrix} \frac{\partial F(\boldsymbol\theta)}{\partial \theta_{11}} & \frac{\partial F(\boldsymbol\theta)}{\partial \theta_{12}} & \cdots & \frac{\partial F(\boldsymbol\theta)}{\partial \theta_{1n}}\\ \frac{\partial F(\boldsymbol\theta)}{\partial \theta_{21}} & \frac{\partial F(\boldsymbol\theta)}{\partial \theta_{22}} & \cdots & \frac{\partial F(\boldsymbol\theta)}{\partial \theta_{2n}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial F(\boldsymbol\theta)}{\partial \theta_{m1}} & \frac{\partial F(\boldsymbol\theta)}{\partial \theta_{m2}} & \cdots & \frac{\partial F(\boldsymbol\theta)}{\partial \theta_{mn}} \end{bmatrix}. \]

After we construct the loss function, we can begin to search for the optimal solution. The most popular optimizer is the gradient descent. The gradient descent is given by:

\[ \boldsymbol\theta_{n+1} = \boldsymbol\theta_n - \alpha \nabla_\boldsymbol{\theta} F(\boldsymbol\theta_n). \]

Here, we have $\alpha$ as the learning rate. Note that what we use here is the first order derivative. It is not directly originated from the Newton’s method. The gradient descent is a first order optimization method. The second order optimization method is the Newton’s method. The Newton’s method is given by:

\[ \boldsymbol{\theta}_{n+1}=\boldsymbol{\theta}_n-\alpha\left[\nabla^2 F\left(\boldsymbol{\theta}_n\right)\right]^{-1} \nabla F\left(\boldsymbol{\theta}_n\right). \]

It is usually faster for the convergence, but it is more expensive for the computation in each step. Thus, we need to have a tradeoff. We can also consider to use quasi-Newton’s method, and there are algorithms like BFGS and L-BFGS. The BFGS is the Broyden-Fletcher-Goldfarb-Shanno algorithm. The L-BFGS is the limited memory BFGS.

The Adam optimizer

There are many important concept of the Adam optimizer. We first introduce the concepts:

First momentum: This is the mean of the gradients. Instead of using the current gradient, we choose to use the mean of the gradients. And we use the exponential moving average to calculate the mean;
Second momentum: This is the variance of the gradients. The second moment in Adam captures the variance of the gradients, providing information about their variability. This is used to adaptively adjust the learning rate for each parameter, allowing parameters with large gradients to have smaller updates and vice versa. This adaptive learning rate mechanism helps in stabilizing the optimization process, especially in the presence of noisy gradients or sparse data.

Now for one step of the Adam optimizer, we have the following steps: 1. Compute the gradient, and this can be done by pytorch autograd already; 2. First momentum: \[ m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \] 3. Second momentum: \[ v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \] 4. Bias correction: \[ \hat{m}_t = \frac{m_t}{1-\beta_1^t} \] \[ \hat{v}_t = \frac{v_t}{1-\beta_2^t} \] 5. Update the parameters: \[ \theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \]

And here the hyperparameters are:

$\alpha$: the learning rate;
$\beta_1$: the parameter for the first momentum;
$\beta_2$: the parameter for the second momentum;
$\epsilon$: the small value to avoid the division by zero.

Pytorch implementation

The autograd in pytorch

To learn how the pytorch optimizer works, we should be aware that the pytorch can use the automatic differentiation to compute the gradient. The automatic differentiation is given by the backpropagation. Thus, we don’t need bother to calculate the gradient, analytically. Note that all the optimizers uses this technique. A very simple example would be:

import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2
z = y * y * 3
out = z.mean()

out.backward()
print(x.grad)

The AdamW optimizer in pytorch

Tip

AdamW and Adam are different, since they have different weight decay procedure. AdamW’s weight decay is not scaled by adaptative learning rate.

Refer to the official documentation. The math formulas to be implemented are:

\[\begin{aligned} &\rule{180mm}{0.4pt} \\ &\textbf{input} : \gamma \text{(lr)}, \: \beta_1, \beta_2 \text{(betas)}, \: \theta_0 \text{(params)}, \: f(\theta) \text{(objective)}, \: \epsilon \text{ (epsilon)} \\ &\hspace{13mm} \lambda \text{(weight decay)} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ (first moment)}, v_0 \leftarrow 0 \text{ ( second moment)} \\[-1.ex] &\rule{180mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}\text{Get a minibatch and calculate the loss $f_t(\theta_{t-1})$} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \theta_t \leftarrow \theta_{t-1} - \gamma \lambda \theta_{t-1} \quad \textit{weight decay} \\ &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ &\hspace{5mm}\widehat{m_t} \leftarrow m_t/\big(1-\beta_1^t \big) \\ &\hspace{5mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t}/ \big(\sqrt{\widehat{v_t}} + \epsilon \big) \\ &\rule{180mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{180mm}{0.4pt} \\[-1.ex] \end{aligned}\]

Tip

The momentum is like the momentum in Physics, if we look at the directional arrow between different solution points. This first momentum will be served as the direction, and magnitude of the difference. The second momentum is like the variance of the gradients. It is used to adjust the learning rate for each parameter. When the gradient variance is large (this is suddenly a large variance, we decrease the learning rate, vice versa). The $\epsilon$ is used to avoid the division by zero.

Bias corrections are needed for $m_t$ and $v_t$ in the Adam optimizer because they are initialized at 0, and they use exponential moving averages which are biased towards 0 at the start of training. This can cause the estimates to be too low at the beginning of optimization. Note that $\beta_1$ and $\beta_2$ are usually selected as very close to 1, in pytorch the default values are 0.9 and 0.999. The $\epsilon$ is usually set to a small value like $10^{-8}$.

The associated code implementation without special setup would be:

def _single_tensor_adamw(
    params: List[Tensor],
    grads: List[Tensor],
    exp_avgs: List[Tensor],
    exp_avg_sqs: List[Tensor],
    max_exp_avg_sqs: List[Tensor],
    state_steps: List[Tensor],
    grad_scale: Optional[Tensor],
    found_inf: Optional[Tensor],
    *,
    amsgrad: bool,
    beta1: float,
    beta2: float,
    lr: Union[Tensor, float],
    weight_decay: float,
    eps: float,
):

    assert grad_scale is None and found_inf is None

    if torch.jit.is_scripting():
        # this assert is due to JIT being dumb and not realizing that the ops below
        # have overloads to handle both float and Tensor lrs, so we just assert it's
        # a float since most people using JIT are using floats
        assert isinstance(lr, float)

    for i, param in enumerate(params):
        grad = grads[i] if not maximize else -grads[i]
        exp_avg = exp_avgs[i]
        exp_avg_sq = exp_avg_sqs[i]
        step_t = state_steps[i]
        
        # update step
        step_t += 1

        # Perform stepweight decay
        param.mul_(1 - lr * weight_decay)

        # Decay the first and second moment running average coefficient
        exp_avg.lerp_(grad, 1 - beta1)
        exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)

        step = _get_value(step_t)

        bias_correction1 = 1 - beta1 ** step
        bias_correction2 = 1 - beta2 ** step

        step_size = lr / bias_correction1

        bias_correction2_sqrt = _dispatch_sqrt(bias_correction2)

        if amsgrad:
            # Maintains the maximum of all 2nd moment running avg. till now
            torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])

            # Use the max. for normalizing running avg. of gradient
            denom = (max_exp_avg_sqs[i].sqrt() / bias_correction2_sqrt).add_(eps)
        else:
            denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)

        param.addcdiv_(exp_avg, denom, value=-step_size)